{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fairness Linear Learner  in SageMaker\n",
    "\n",
    "1. [Introduction](#Introduction)\n",
    "    1. [Fairness definition](#fairness)\n",
    "2. [Prerequisites and Data](#pre_and_data)\n",
    "    1. [Initialize SageMaker](#initsagemaker)\n",
    "    2. [Download data](#download_data)\n",
    "    3. [Loading the data: Adult Dataset](#load_data) \n",
    "    4. [Data inspection](#inspect_data) \n",
    "    5. [Data encoding](#encode_data) \n",
    "    6. [Data conversion and upload of the training data](#upload_data) \n",
    "3. [Training the standard linear model](#train_linear_model)\n",
    "    1. [Accuracy and Fairness of the model](#performance_linear_model)\n",
    "4. [Changing the data to impose fairness](#impose_fairness)\n",
    "    1. [Train the model with the fair data](#train_fair_model)\n",
    "    2. [Accuracy and Fairness of the fair model](#performance_fair_model)\n",
    "    3. [Sanity check: performance on the training set](#performance_fair_model_train)\n",
    "5. [Distribution of the outputs](#distrib)\n",
    "    \n",
    "\n",
    "\n",
    "## Introduction <a class=\"anchor\" id=\"Introduction\">\n",
    "There have recently been concerns about bias in machine learning algorithms as a result of mimicking existing human prejudices. Nowadays, several Machine Learning methods have strong social implications, for example they are used to predict bank loans, insurance rates or advertising. Unfortunately, an algorithm that learns from historical data will naturally inherit the past biases. In this notebook, we present how to overcome this problem by using SageMaker and Fair Algorithms in the context of Linear Learners.\n",
    "    \n",
    "We will start by introducing some of the concepts and math behind fairness, then we will get ourselves setup to use these concepts in SageMaker, download data, train a model, and finally apply our fairness concepts to adjust our model predictions appropriately."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fairness definition <a class=\"anchor\" id=\"fairness\">\n",
    "We introduce here a classic measure of Fairness. Giving a dataset, we can consider a binary sensitive feature (for example, gender between male and female). We show here definition of Equal Opportunity$^{[1]}$ among these two groups $A$ (female) and $B$ (male) and a trained model $f$:\n",
    "\n",
    "$$\\mathbb{P}_{(x,y)} \\big[ f(x)>0 \\, \\big| \\, x \\text{ in group } A, y = 1 \\big] = \\mathbb{P}_{(x,y)} \\big[ f(x)>0 \\, \\big| \\, x \\text{ in group } B, y = 1 \\big],$$\n",
    "\n",
    "where $\\mathbb{P}_{(x,y)}$ is the probability with respect to all the possible examples $(x,y)$.\n",
    "\n",
    "Practically, we are imposing the same True Positive Rate (TPR) among the two groups. Starting from this definition, we can estimate the Difference in Equal Opportunity (DEO) of the model $f$ as:\n",
    "\n",
    "$$DEO(f) = \\Big| \\mathbb{P}_{(x,y)}\\big[ f(x)>0 \\, \\big| \\, x \\text{ in group } A, y = 1 \\big] -  \\mathbb{P}_{(x,y)} \\big[ f(x)>0 \\, \\big| \\, x \\text{ in group } B, y = 1 \\big] \\Big|.$$\n",
    "\n",
    "A low value of this measure means a fair model $f$ with respect to the sensitive feature.\n",
    "\n",
    "Now, we introduce a method to empirically evalute Accuracy, TPR and DEO from a list of predictions. The input of this method are: dataset, list of predictions obtained by using a machine learning model and indices of the exmaple in group $A$ and group $B$.\n",
    "\n",
    "$^{[1]}$Moritz Hardt, Eric Price, and Nati Srebro. \"[Equality of opportunity in supervised learning](http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf)\". Advances in neural information processing systems (2016)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "def deo_from_list(dataset, predictions, groupA_idxs, groupB_idxs):\n",
    "    tnA, fpA, fnA, tpA  = confusion_matrix(np.where(dataset[groupA_idxs][:,-1] == 1, 1, 0),\n",
    "                           predictions[groupA_idxs]).ravel()\n",
    "    tnB, fpB, fnB, tpB  = confusion_matrix(np.where(dataset[groupB_idxs][:,-1] == 1, 1, 0),\n",
    "                                           predictions[groupB_idxs]).ravel()\n",
    "\n",
    "    print('Examples in group A: %d' % len(groupA_idxs))\n",
    "    print('Examples in group B: %d' % len(groupB_idxs))\n",
    "\n",
    "    print('TPR group A: %f' % (float(tpA) / (tpA + fnA)))\n",
    "    print('TPR group B: %f' % (float(tpB) / (tpB + fnB)))\n",
    "    print('TPR all dataset: %f' % (float(tpA + tpB) / (tpA + fnA + tpB + fnB)))\n",
    "    print('Accuracy all dataset: %f' % (float(tpA + tnA + tpB + tnB) /\n",
    "          (tpA + fpA + fnA + tnA + tpB + fpB + fnB + tnB)))\n",
    "    \n",
    "    return(np.abs(float(tpA) / (tpA + fnA) - float(tpB) / (tpB + fnB)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is usefull to have a method to evaluate Accuracy, TPR and DEO directly from a model. In this case the method requries as input: our trained model $f$, dataset and index of the sensitive feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def deo_from_model(model, dataset, sensitive_feature, groupA_value=None):\n",
    "    if groupA_value == None:\n",
    "        groupA_value = np.max(dataset[:, sensitive_feature])\n",
    "        \n",
    "    groupA_idxs = [idx for idx, val in enumerate(dataset)\n",
    "                   if val[sensitive_feature] == groupA_value]\n",
    "    groupB_idxs = [idx for idx, val in enumerate(dataset)\n",
    "                   if val[sensitive_feature] != groupA_value]\n",
    "    \n",
    "    predictions = []\n",
    "    for array in np.array_split(dataset[:,:-1], 100):\n",
    "        result = model.predict(array)\n",
    "        predictions += [r['predicted_label'] for r in result['predictions']]\n",
    "    \n",
    "    predictions = np.array(predictions)\n",
    "    \n",
    "    return deo_from_list(dataset, predictions, groupA_idxs, groupB_idxs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites and Data <a class=\"anchor\" id=\"pre_and_data\">\n",
    "### Initialize SageMaker  <a class=\"anchor\" id=\"initsagemaker\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import Session\n",
    "bucket = Session().default_bucket() #'fairness-test2'\n",
    "prefix = 'sagemaker/DEMO-linear-adult'\n",
    "\n",
    "# Define IAM role\n",
    "from sagemaker import get_execution_role\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import urllib\n",
    "import os\n",
    "import sklearn.preprocessing as preprocessing\n",
    "import seaborn as sns\n",
    "\n",
    "role = get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download data <a class=\"anchor\" id=\"download_data\">    \n",
    "Data Source: [https://archive.ics.uci.edu/ml/machine-learning-databases/adult/](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/)\n",
    "\n",
    "Let's __download__ the data and save it in the local folder with the name adult.data and adult.test from UCI repository$^{[2]}$.\n",
    "\n",
    "$^{[2]}$Dua Dheeru, and Efi Karra Taniskidou. \"[UCI Machine Learning Repository](http://archive.ics.uci.edu/ml)\". Irvine, CA: University of California, School of Information and Computer Science (2017)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.isfile('adult.data'):\n",
    "    urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',\n",
    "                              'adult.data')\n",
    "    print('adult.data saved!')\n",
    "else:\n",
    "    print('adult.data already here.')\n",
    "\n",
    "if not os.path.isfile('adult.test'):\n",
    "    urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test',\n",
    "                              'adult.test')\n",
    "    print('adult.test saved!')\n",
    "else:\n",
    "    print('adult.test already here.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the data: Adult Dataset <a class=\"anchor\" id=\"load_data\">\n",
    "From UCI repository, this database contains 14 features concerning demographic characteristics of $45222$ instances ($32561$ for training and $12661$ for test). The task is to predict if a person has an income per year that is more (or less) than $50000\\,\\$$.\n",
    "\n",
    "Here the list of the features and their possible values:\n",
    "- (1) **Age**: continuous.\n",
    "- (2) **Workclass**: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.\n",
    "- (3) **Fnlwgt**: continuous (the number of people the census takers believe that observation represents).\n",
    "- (4) **Education**: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.\n",
    "- (5) **Education-num**: continuous.\n",
    "- (6) **Marital-status**: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.\n",
    "- (7) **Occupation**: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.\n",
    "- (8) **Relationship**: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.\n",
    "- (9) **Ethnic group**: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.\n",
    "- (10) **Gender**: Female, Male.\n",
    "- (11) **Capital-gain**: continuous.\n",
    "- (12) **Capital-loss**: continuous.\n",
    "- (13) **Hours-per-week**: continuous.\n",
    "- (14) **Native-country**: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.\n",
    "\n",
    "And finally our binary prediction task:\n",
    "- (15) **Target**: <=50, >50."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "original_data = pd.read_csv(\n",
    "    \"adult.data\",\n",
    "    names=[\n",
    "        \"Age\", \"Workclass\", \"fnlwgt\", \"Education\", \"Education-Num\", \"Martial Status\",\n",
    "        \"Occupation\", \"Relationship\", \"Ethnic group\", \"Sex\", \"Capital Gain\", \"Capital Loss\",\n",
    "        \"Hours per week\", \"Country\", \"Target\"],\n",
    "        sep=r'\\s*,\\s*',\n",
    "        engine='python',\n",
    "        na_values=\"?\")\n",
    "original_test = pd.read_csv(\n",
    "    \"adult.test\",\n",
    "    names=[\n",
    "        \"Age\", \"Workclass\", \"fnlwgt\", \"Education\", \"Education-Num\", \"Martial Status\",\n",
    "        \"Occupation\", \"Relationship\", \"Ethnic group\", \"Sex\", \"Capital Gain\", \"Capital Loss\",\n",
    "        \"Hours per week\", \"Country\", \"Target\"],\n",
    "        sep=r'\\s*,\\s*',\n",
    "        engine='python',\n",
    "        na_values=\"?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data inspection  <a class=\"anchor\" id=\"inspect_data\">\n",
    "Plotting histograms of the distribution of the different features is a good way to visualize the data. We plot both the whole dataset distributions (left) and the distrubtions in the case of positive labels only (right)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from math import ceil\n",
    "\n",
    "positive_idxs = [idx for idx, val in enumerate(original_data['Target']) if val == \">50K\"]\n",
    "\n",
    "fig = plt.figure(figsize=(20,100))\n",
    "cols = 2\n",
    "rows = ceil(float(original_data.shape[1]) / cols) * 2\n",
    "for i, column in enumerate(original_data.columns):\n",
    "    ax = fig.add_subplot(rows, cols, 2 * i + 1)\n",
    "    ax.set_title(column)\n",
    "    if original_data.dtypes[column] == np.object:\n",
    "        original_data[column][:].value_counts(sort=True).plot(kind=\"bar\", axes=ax)\n",
    "    else:\n",
    "        original_data[column][:].hist(axes=ax)\n",
    "        plt.xticks(rotation=\"vertical\")\n",
    "        \n",
    "    ax = fig.add_subplot(rows, cols, 2 * i + 2)\n",
    "    ax.set_title(column + \" (only positive examples)\")\n",
    "    if original_data.dtypes[column] == np.object:\n",
    "        original_data[column][positive_idxs].value_counts(sort=True).plot(kind=\"bar\", axes=ax)\n",
    "    else:\n",
    "        original_data[column][positive_idxs].hist(axes=ax)\n",
    "        plt.xticks(rotation=\"vertical\")\n",
    "\n",
    "plt.subplots_adjust(hspace=0.7, wspace=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data encoding   <a class=\"anchor\" id=\"encode_data\">\n",
    "We apply a preprocessing encoder for the categorial features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Encode the categorical features as numbers\n",
    "def number_encode_features(df):\n",
    "    result = df.copy()\n",
    "    encoders = {}\n",
    "    for column in result.columns:\n",
    "        if result.dtypes[column] == np.object:\n",
    "            encoders[column] = preprocessing.LabelEncoder()\n",
    "            #  print('Column:', column, result[column])\n",
    "            result[column] = encoders[column].fit_transform(result[column].fillna('None'))\n",
    "    return result, encoders\n",
    "\n",
    "# Calculate the correlation and plot it\n",
    "encoded_data, _ = number_encode_features(original_data)\n",
    "training_data_matrix = np.array(encoded_data.values, dtype=float)\n",
    "encoded_data, _ = number_encode_features(original_test)\n",
    "test_data_matrix = np.array(encoded_data.fillna(0).values, dtype=float)\n",
    "\n",
    "scaler = preprocessing.MinMaxScaler(feature_range=(0.0, 1.0))\n",
    "training_data_matrix = scaler.fit_transform(training_data_matrix)\n",
    "test_data_matrix = scaler.transform(test_data_matrix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data conversion and upload of the training data    <a class=\"anchor\" id=\"upload_data\">\n",
    "\n",
    "Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the Amazon SageMaker implementation of Linear Learner takes recordIO-wrapped protobuf, where the data we have today is a pickle-ized numpy array on disk.\n",
    "\n",
    "We also need to upload it to S3, so that Amazon SageMaker training can use it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import numpy as np\n",
    "import sagemaker.amazon.common as smac\n",
    "import boto3\n",
    "import os\n",
    "\n",
    "vectors = np.array([t.tolist() for t in training_data_matrix[:,:-1]]).astype('float32')\n",
    "labels = np.where(np.array([t.tolist() for t in training_data_matrix[:,-1]]) == 1, 1, 0).astype('float32')\n",
    "\n",
    "buf = io.BytesIO()\n",
    "smac.write_numpy_to_dense_tensor(buf, vectors, labels)\n",
    "buf.seek(0)\n",
    "\n",
    "key = 'recordio-pb-data'\n",
    "boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)\n",
    "s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)\n",
    "print('uploaded training data location: {}'.format(s3_train_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also setup an output S3 location for the model artifact that will be output as the result of training with the algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n",
    "print('training artifacts will be uploaded to: {}'.format(output_location))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Standard linear model  <a class=\"anchor\" id=\"train_linear_model\">\n",
    "\n",
    "Once we have the data preprocessed and available in the correct format for training, the next step is to actually train the model using the data. More details on algorithm containers can be found in [AWS documentation](https://docs-aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "import sagemaker\n",
    "\n",
    "container = get_image_uri(boto3.Session().region_name, 'linear-learner', \"latest\")\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "linear = sagemaker.estimator.Estimator(container,\n",
    "                                       role, \n",
    "                                       train_instance_count=1, \n",
    "                                       train_instance_type='ml.c4.xlarge',\n",
    "                                       output_path=output_location,\n",
    "                                       sagemaker_session=sess)\n",
    "linear.set_hyperparameters(feature_dim=14,\n",
    "                           predictor_type='binary_classifier',\n",
    "                           mini_batch_size=200)\n",
    "\n",
    "linear.fit({'train': s3_train_data})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accuracy and Fairness of the model <a class=\"anchor\" id=\"performance_linear_model\">\n",
    "Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint.  This will allow out to make predictions (or inference) from the model dyanamically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.predictor import csv_serializer, json_deserializer\n",
    "linear_predictor = linear.deploy(initial_instance_count=1,\n",
    "                                 instance_type='ml.m4.xlarge')\n",
    "\n",
    "\n",
    "linear_predictor.content_type = 'text/csv'\n",
    "linear_predictor.serializer = csv_serializer\n",
    "linear_predictor.deserializer = json_deserializer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Prediction for the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "predictions = []\n",
    "distance_from_hyperplane = []\n",
    "for array in np.array_split(test_data_matrix[:,:-1], 100):\n",
    "    result = linear_predictor.predict(array)\n",
    "    predictions += [r['predicted_label'] for r in result['predictions']]\n",
    "    distance_from_hyperplane += [r['score'] for r in result['predictions']]\n",
    "\n",
    "predictions_test = np.array(predictions)\n",
    "distance_from_hyperplane_test = np.array(distance_from_hyperplane)    \n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "pd.crosstab(np.where(test_data_matrix[:,-1] == 1, 1, 0), predictions_test,\n",
    "            rownames=['actuals'], colnames=['predictions'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We selcted as sensitive feature the gender, dividing the data between \"female\" and \"male\". Let's check if our model is fair (in the sense of Equal Opportunity). In the following the performance concerning Accuracy, True Positive Rate (among the two groups and the whole dataset) and DEO."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sensitive_feature_gender = 9  # Gender\n",
    "groupA_value = 0.0 # Female\n",
    "\n",
    "deo = deo_from_model(linear_predictor, test_data_matrix,\n",
    "                     sensitive_feature_gender, groupA_value=groupA_value)\n",
    "print('DEO: %f' % deo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A DEO of about $0.24$ is quite high, this means that there is a large gap between the True Positive Rate of the group A (female) and the group B (male) and cosequently there is not Equal Opportunity. \n",
    "\n",
    "In other words, we can say that for every $100$ correctly predicted positive examples in one group, the disadvantage group will have only $100 - 24 = 76$  of them. Consequently, it is clear that the distribution of the error is unfair, and our goal now is to solve this unfairness maintaining a comparable prediction accuracy.\n",
    "\n",
    "## Changing the data to impose fairness <a class=\"anchor\" id=\"impose_fairness\">\n",
    "\n",
    "In the case of linear model, there is a simple but powerful way to impose the fairness constraint$^{[3]}$, i.e. $DEO(f) = 0$. Firstly, we have to introduce the following vector $u$:\n",
    "\n",
    "$$ u = \\frac{1}{n(+, A)} \\sum_{x \\in X(+, A)} x - \\frac{1}{n(+, B)} \\sum_{x \\in X(+, B)} x,$$\n",
    "\n",
    "where $X(+, A)$ and $X(+, B)$ are the sets of positively labeled examples in group $A$ and $B$ and $n(+, A)$ and $n(+, B)$ their cardinalities. The vector $u$ represents the unfair model, and our goal is to impose a serach space of the models that is orthogonal to it. In the implementation we consider, without loss of generality, that the binary sensitive feature is $0$ for the gorup A and $+1$ for group B.\n",
    "\n",
    "In our linear case, we can impose this orthogonal constraint, i.e. impose the fairness, by applying a preprocessing of the original data as following:\n",
    "\n",
    "$$ \\hat{x}_j = x_j - x_i \\frac{u_j}{u_i} \\,\\,\\,\\, j \\in \\{1, \\dots, i-1, i+1, \\dots, d\\}.$$\n",
    "Where $i$ is the index of the sensitive feature and $\\hat{x}_j$ is the new value for the feature $j^{th}$. It is important to note that - as consequence of this method - the new data representation has a $1$ feature less compared the original one (in our case the sensitive feature).\n",
    "\n",
    "$^{[3]}$Michele Donini, et al. \"[Empirical Risk Minimization under Fairness Constraints](https://arxiv.org/abs/1802.08626).\" arXiv preprint arXiv:1802.08626 (2018)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class UncorrelationMethod:\n",
    "    def __init__(self, sensitive_feature, groupA_value=0.0):\n",
    "        self.val0 = groupA_value\n",
    "        self.sensitive_feature = sensitive_feature\n",
    "        self.u = None\n",
    "\n",
    "    def new_representation(self, examples):\n",
    "        if self.u is None:\n",
    "            print('You have to fit the model first...')\n",
    "            return examples\n",
    "        new_examples = np.array([ex if ex[self.sensitive_feature] == self.val0 # case x_i = 0, Group A\n",
    "                                 else ex + self.u for ex in examples]) # case x_i = 1, Group B\n",
    "        new_examples = np.delete(new_examples, self.sensitive_feature, 1)\n",
    "        return new_examples\n",
    "    \n",
    "    def fit(self, dataset):\n",
    "        tmp = [ex for idx, ex in enumerate(dataset)\n",
    "               if dataset[idx, -1] == 1 and ex[self.sensitive_feature] == self.val0]\n",
    "        average_A_1 = np.mean(tmp, 0)\n",
    "        n_A_1 = len(tmp)\n",
    "        tmp = [ex for idx, ex in enumerate(dataset)\n",
    "               if dataset[idx, -1] == 1 and ex[self.sensitive_feature] != self.val0]\n",
    "        average_not_A_1 = np.mean(tmp, 0)\n",
    "        n_not_A_1 = len(tmp)\n",
    "        N_1 = len([ex for idx, ex in enumerate(dataset) if dataset[idx, -1] == 1])\n",
    "        self.u = average_A_1[:-1] - average_not_A_1[:-1]\n",
    "        # Our hypothesis of values 0 (A) and +1 (B) for the sensitive feature among the two groups\n",
    "        # has the following consequence:\n",
    "        self.u[self.sensitive_feature] = -1.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point we are ready to apply this algorithm to our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "uncorr_data = UncorrelationMethod(sensitive_feature_gender, 0.0)\n",
    "uncorr_data.fit(training_data_matrix) \n",
    "new_training_data_matrix = np.hstack([uncorr_data.new_representation(training_data_matrix[:, :-1]),\n",
    "                                     training_data_matrix[:, -1:-2:-1]])\n",
    "new_test_data_matrix = np.hstack([uncorr_data.new_representation(test_data_matrix[:, :-1]),\n",
    "                                  test_data_matrix[:, -1:-2:-1]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train the model with the fair data <a class=\"anchor\" id=\"train_fair_model\">\n",
    "\n",
    "Now we have simply to repeat the training by using this new dataset. It is important to note that the new dataset has one feature less than the original one ($13$ instead of $14$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vectors = np.array([t.tolist() for t in new_training_data_matrix[:,:-1]]).astype('float32')\n",
    "labels = np.where(np.array([t.tolist() for t in new_training_data_matrix[:,-1]]) == 1, 1, 0).astype('float32')\n",
    "\n",
    "buf = io.BytesIO()\n",
    "smac.write_numpy_to_dense_tensor(buf, vectors, labels)\n",
    "buf.seek(0)\n",
    "\n",
    "key = 'recordio-pb-data'\n",
    "boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)\n",
    "s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)\n",
    "print('uploaded training data location: {}'.format(s3_train_data))\n",
    "\n",
    "output_location = 's3://{}/{}/output'.format(bucket, prefix)\n",
    "print('training artifacts will be uploaded to: {}'.format(output_location))\n",
    "\n",
    "linearf = sagemaker.estimator.Estimator(container,\n",
    "                                       role, \n",
    "                                       train_instance_count=1, \n",
    "                                       train_instance_type='ml.c4.xlarge',\n",
    "                                       output_path=output_location,\n",
    "                                       sagemaker_session=sess)\n",
    "linearf.set_hyperparameters(feature_dim=13,\n",
    "                           predictor_type='binary_classifier',\n",
    "                           mini_batch_size=200)\n",
    "\n",
    "path_fair_model = linearf.fit({'train': s3_train_data})\n",
    "\n",
    "linear_predictorf = linearf.deploy(initial_instance_count=1,\n",
    "                                 instance_type='ml.m4.xlarge')\n",
    "linear_predictorf.content_type = 'text/csv'\n",
    "linear_predictorf.serializer = csv_serializer\n",
    "linear_predictorf.deserializer = json_deserializer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can calculate the predictions using our fair linear model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = []\n",
    "distance_from_hyperplane = []\n",
    "for array in np.array_split(new_test_data_matrix[:,:-1], 100):\n",
    "    result = linear_predictorf.predict(array)\n",
    "    predictions += [r['predicted_label'] for r in result['predictions']]\n",
    "    distance_from_hyperplane += [r['score'] for r in result['predictions']]\n",
    "\n",
    "distance_from_hyperplane_test_fair = np.array(distance_from_hyperplane)    \n",
    "predictions_test_fair = np.array(predictions)\n",
    "pd.crosstab(np.where(new_test_data_matrix[:,-1] == 1, 1, 0), predictions_test_fair,\n",
    "            rownames=['actuals'], colnames=['predictions'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accuracy and Fairness of the fair model <a class=\"anchor\" id=\"performance_fair_model\">\n",
    "Let's see the performance concerning accuracy and fairness for our new model. We selcted as sensitive feature the Gender, dividing the data between \"female\" and \"male\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "groupA_idxs = [idx for idx, val in enumerate(test_data_matrix)\n",
    "               if val[sensitive_feature_gender] == groupA_value]\n",
    "groupB_idxs = [idx for idx, val in enumerate(test_data_matrix)\n",
    "               if val[sensitive_feature_gender] != groupA_value]\n",
    "deo = deo_from_list(new_test_data_matrix, predictions_test_fair, groupA_idxs, groupB_idxs)\n",
    "print('DEO: %f' % deo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The comparison with respect to the original model is the following:\n",
    "- Concerning the accuracy the difference is not significative. In fact, from an original $0.8235$ of accuracy, we obtain $0.8086$ (decrease of about $1.8 \\%$).\n",
    "- Concerning the DEO, the original model has a level of unfairness of $0.2404$ and our fair model of $0.0612$, with a decrese of more than $75 \\%$.\n",
    "\n",
    "## Sanity check: performance on the training set  <a class=\"anchor\" id=\"performance_fair_model_train\">\n",
    "\n",
    "Let's see the performance of our method on the training set in order to see if we apply the correct constraint and we do not overfit the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = []\n",
    "distance_from_hyperplane = []\n",
    "for array in np.array_split(new_training_data_matrix[:,:-1], 100):\n",
    "    result = linear_predictorf.predict(array)\n",
    "    predictions += [r['predicted_label'] for r in result['predictions']]\n",
    "    distance_from_hyperplane += [r['score'] for r in result['predictions']]\n",
    "\n",
    "distance_from_hyperplane_train_fair = np.array(distance_from_hyperplane)    \n",
    "predictions_train_fair = np.array(predictions)\n",
    "pd.crosstab(np.where(new_training_data_matrix[:,-1] == 1, 1, 0),\n",
    "            predictions_train_fair, rownames=['actuals'], colnames=['predictions'])\n",
    "\n",
    "groupA_idxs = [idx for idx, val in enumerate(training_data_matrix)\n",
    "               if val[sensitive_feature_gender] == groupA_value]\n",
    "groupB_idxs = [idx for idx, val in enumerate(training_data_matrix)\n",
    "               if val[sensitive_feature_gender] != groupA_value]\n",
    "\n",
    "deo = deo_from_list(new_training_data_matrix, predictions_train_fair, groupA_idxs, groupB_idxs)\n",
    "print('DEO: %f' % deo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The value of the DEO is 0.0097. This confirm that our method is able to implement the fairness constraint in the linear case.\n",
    "\n",
    "## Distribution of the outputs <a class=\"anchor\" id=\"distrib\">\n",
    "\n",
    "Now we plot the values of $\\langle w,x \\rangle - b$ for all the examples $x$ with $y=1$, where $w$ is the trained model, for both the original model and our fair one. The value $\\langle w,x \\rangle - b$ can be considered as the distance of $x$ from the hyperplane that divides our feature space between positive and negative examples. When this value is positive means that our classifier predicts a positive label for the example $x$. Consequently, the area of the histogram is a visualization of the True Positive Rate (the difference between the blue and orange areas is an approximation of the DEO). For this reason, similar blue and orange areas measn a more fair model (with respect to the senstive feature \"gender\")."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "sensitive_feature = sensitive_feature_gender\n",
    "\n",
    "SMALL_SIZE = 12\n",
    "MEDIUM_SIZE = 12\n",
    "BIGGER_SIZE = 16\n",
    "bins = 30\n",
    "\n",
    "X = training_data_matrix[:, :-1]\n",
    "y = training_data_matrix[:, -1]\n",
    "Xte = test_data_matrix[:, :-1]\n",
    "yte = test_data_matrix[:, -1]\n",
    "ypos = np.max(y)\n",
    "yneg = np.min(y)\n",
    "idx_group_A1 = [idx for idx, v in enumerate(Xte) \n",
    "                if v[sensitive_feature] == groupA_value and yte[idx] == ypos]\n",
    "idx_group_B1 = [idx for idx, v in enumerate(Xte) \n",
    "                if v[sensitive_feature] != groupA_value and yte[idx] == ypos]\n",
    "\n",
    "\n",
    "titles = ['Adult Dataset - TPR Area - Linear',\n",
    "         'Adult Dataset - TPR Area - Fair Linear']\n",
    "for i,distance_from_hyperplane in enumerate([distance_from_hyperplane_test,\n",
    "                                           distance_from_hyperplane_test_fair]):\n",
    "    distance_from_hyperplane = distance_from_hyperplane - 0.5\n",
    "    xmin = np.min([np.min(distance_from_hyperplane[idx_group_A1]),\n",
    "                   np.min(distance_from_hyperplane[idx_group_B1])])\n",
    "    xmax = np.max([np.max(distance_from_hyperplane[idx_group_A1]),\n",
    "                   np.max(distance_from_hyperplane[idx_group_B1])])\n",
    "    fig, ax = plt.subplots(figsize=(8, 6), dpi=90)\n",
    "    plt.rc('font', size=SMALL_SIZE)          # controls default text sizes\n",
    "    plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title\n",
    "    plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels\n",
    "    plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels\n",
    "    plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels\n",
    "    plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize\n",
    "    #plt.rc('figure', titlesize=MEDIUM_SIZE)  # fontsize of the figure title\n",
    "    pdf, bins, patches = ax.hist(distance_from_hyperplane[idx_group_A1], bins=bins,\n",
    "                                 density=True, stacked=True, label='A=Female, Y=1', alpha=1.0)\n",
    "    ax.hist(distance_from_hyperplane[idx_group_B1], bins=bins, density=True,\n",
    "            stacked=True, label='B=Male, Y=1', alpha=0.5)\n",
    "    ax.legend(loc='upper left')\n",
    "    ax.set_xlim(left=0.0, right=xmax)\n",
    "    ax.set_ylim(0, 3)\n",
    "    plt.title(titles[i])\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### (Optional) Delete the Endpoint\n",
    "\n",
    "If you're ready to be done with this notebook, please run the delete_endpoint line in the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "sagemaker.Session().delete_endpoint(linear_predictor.endpoint)\n",
    "sagemaker.Session().delete_endpoint(linear_predictorf.endpoint)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \\\"License\\\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \\\"license\\\" file accompanying this file. This file is distributed on an \\\"AS IS\\\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
